Slides 35,36,37 are the annotations for the Replacement Algorithms





# Computer Organization and Software Systems CONTACT SESSION 4

Prof. C R Sarma WILP.BITS-PILANI



# Today's Session

| Contact<br>Hour | List of Topic Title                                            | Text/Ref<br>Book/external |
|-----------------|----------------------------------------------------------------|---------------------------|
|                 |                                                                | resource                  |
| 7-8             | <ul> <li>Memory Organization (Contd)</li> </ul>                | T1, R1                    |
|                 | <ul> <li>Cache Memories (Contd)</li> </ul>                     |                           |
|                 | <ul> <li>Set Associative Caches</li> </ul>                     |                           |
|                 | <ul> <li>Issues with Writes</li> </ul>                         |                           |
|                 | <ul> <li>Performance Impact of Cache<br/>Parameters</li> </ul> |                           |
|                 | <ul> <li>Writing Cache friendly Codes</li> </ul>               |                           |
|                 | <ul> <li>Replacement Algorithms</li> </ul>                     |                           |

# Set Associative Mapping



Word

Set

m line Cache is divided into a number of sets (v sets each with k lines)

Tag

• m = v \* k

i = j modulo v

where i = cache set number

j = main memory block number

m = number of lines in the cache

- Each set contains 'k' number of lines
- A given block maps to any line in a given set
  - e.g. Block B can be in any line of set i
- e.g. 2 lines per set
  - 2 way set associative mapping
  - A given block can be in one of 2 lines in only one set

#### Example

- 16 Bytes main memory, Block Size is 2 Bytes,
- Cache of 8 Bytes, 2 way set associative cache
  - # address bits
  - Cache line size
  - # main memory blocks
  - # Number of cache lines
  - # lines per set
  - # of sets



| 0000   |            | Closel         |
|--------|------------|----------------|
| 0001   |            | <del>5</del> 6 |
| 000    |            |                |
| 0011   |            | P <sup>1</sup> |
| 0 100  |            |                |
| Olai   |            | ده             |
| 0) 1(0 |            | L_=            |
| 0111   |            | <b>U E</b>     |
| 1000   |            | 1              |
| 1001   |            | b <sub>4</sub> |
| loto   |            | 6-             |
| Lot    |            | -05            |
| [ 100  |            |                |
| 1101   |            | _ <u>-</u>     |
| 1110   |            | 67             |
| Ш      |            | ,              |
|        | Main Memon | t              |

# Example - Mapping Function

| i = j modulo v | Set # |
|----------------|-------|
| 0%2            |       |
| 1%2            |       |
| 2%2            |       |
| 3%2            |       |
| 4%2            |       |
| 5%2            |       |
| 6%2            |       |
| 7%2            |       |



|        |            | Cont     |
|--------|------------|----------|
| 2001   |            | lead     |
| 0 O Io |            | L        |
| 0011   |            | P1       |
| 00100  |            |          |
| Ola    |            | ده       |
| 0) 10  |            | 62       |
| 0111   |            | <b>U</b> |
| 1000   |            | _        |
| 1001   |            | b4       |
| loto   |            | 6        |
| Lott   |            |          |
| [ (00  |            | I        |
| 101    |            |          |
| 1110   |            | Ь-       |
| ПП     |            | _,       |
|        | Main Memon | Ł        |

TAG SET WORD

lead

**Direct Mapping link** 

Associative Mapping Link



#### Set Associative Mapping Summary

```
Address length = (s + w) bits

Number of addressable units = 2^{s+w} words or bytes

Block size = line size = 2^w words or bytes

Number of blocks in main memory = 2^d

Number of lines in set = k

Number of sets = v = 2^d

Number of lines in cache = kv = k * 2^d

Size of tag = (s - d) bits
```

#### Problem 1



White Board

- A set-associative cache consists of 64 lines, or slots, divided into four-line sets. Main memory contains 4K blocks of 128 bytes each. Show the format of main memory addresses. Find out
  - Total main memory capacity
  - Total cache memory capacity
  - Total number of sets in the cache
  - Number of bits for TAG, SET and word



#### Problem 2(Byte Addressable)

A 4-way set-associative cache memory unit with a capacity of 16 KB is built using a block size of 8 words. The word length is 32 bits. The siz of the physical address space is 4 GB. Find out address format.



#### Problem 2(Word Addressable

A 4-way set-associative cache memory unit with a capacity of 16 KB is built using a block size of 8 words. The word length is 32 bits. The size of the physical address space is 4 GB. Find out address format.

#### Direct mapped cache

- No choice
- Each block maps to one line and replace that line

#### Replacement Algorithms (2/3) invoked

- Needed in Associative & Set Associative mapped cache
- Hardware implemented algorithm (speed)
- Methods:
  - Least Recently Used (LRU)
  - Least Frequently Used (LFU)
  - First In First Out (FIFO)
  - Random

- Least Recently used (LRU): Replace the block in the set that has been in the cache longest with no reference to it
  - e.g. 2 way set associative
  - Uses "USE" bits
  - Most effective method
- Least frequently used: Replace block which has had fewest hits
  - Uses counter with each line
- First in first out (FIFO): Replace block that has been in cache longest
  - Round robin or circular buffer technique
- Random

#### Problem 3

Consider a reference pattern that accesses the sequence of blocks 0, 4, 0, 2, 1, 8, 0, 1, 2, 3, 0, 4. Assuming that the cache uses associative mapping, find the hit ratio with a cache of four lines

- a) LRU
- b) LFU
- c) FIFO

# Problem 2 - LRU



| Ref  | 0 | 4 | 0 | 2 | 1 | 8 | 0 | 1 | 2 | 3 | 0  | 4  |
|------|---|---|---|---|---|---|---|---|---|---|----|----|
| time | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
| LO   |   |   |   |   |   |   |   |   |   |   |    |    |
| L1   |   |   |   |   |   |   |   |   |   |   |    |    |
| L2   |   |   |   |   |   |   |   |   |   |   |    |    |
| L3   |   |   |   |   |   |   |   |   |   |   |    |    |
| H/M  |   |   |   |   |   |   |   |   |   |   |    |    |



lead

#### Problem 2 - LFU

| Ref | 0 | 4 | 0 | 2 | 1 | 8 | 0 | 1 | 2 | 3 | 0 | 4 |
|-----|---|---|---|---|---|---|---|---|---|---|---|---|
| LO  |   |   |   |   |   |   |   |   |   |   |   |   |
| L1  |   |   |   |   |   |   |   |   |   |   |   |   |
| L2  |   |   |   |   |   |   |   |   |   |   |   |   |
| L3  |   |   |   |   |   |   |   |   |   |   |   |   |
| H/M |   |   |   |   |   |   |   |   |   |   |   |   |



#### Problem 2 - FIFO



| Ref  | 0 | 4 | 0 | 2 | 1 | 8 | 0 | 1 | 2 | 3 | 0  | 4  |
|------|---|---|---|---|---|---|---|---|---|---|----|----|
| time | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
| LO   |   |   |   |   |   |   |   |   |   |   |    |    |
| L1   |   |   |   |   |   |   |   |   |   |   |    |    |
| L2   |   |   |   |   |   |   |   |   |   |   |    |    |
| L3   |   |   |   |   |   |   |   |   |   |   |    |    |
| H/M  |   |   |   |   |   |   |   |   |   |   |    |    |



#### Issues with Writes

- Multiple copies of data exist:
  - L1, L2, L3, Main Memory, Disk
- What to do on a write-hit?
  - Write-through (write immediately to memory)
  - Write-back (defer write to memory until replacement of line)
    - Need a dirty bit (line different from memory or not)



(b) Three-level cache organization

Cache and Main Memory

#### Issues with Writes

- What to do on a write-miss?
  - Write-allocate (load into cache, update line in cache)
    - Good if more writes to the location follow
  - No-write-allocate (writes straight to memory, does not load into cache)
- Typical
  - Write-through + No-write-allocate
  - Write-back + Write-allocate

#### Intel Core i7 Cache Hierarchy





| innovate | achieve | lead |
|----------|---------|------|
|          |         |      |

| Cache type       | Access time (cycles) | Cache size (C) | Assoc. $(E)$ | Block size $(B)$ | Sets (S) |
|------------------|----------------------|----------------|--------------|------------------|----------|
| L1 i-cache       | 4                    | 32 KB          | 8            | 64 B             | 64       |
| L1 d-cache       | 4                    | 32 KB          | 8            | 64 B             | 64       |
| L2 unified cache | 11                   | 256 KB         | 8            | 64 B             | 512      |
| L3 unified cache | 30-40                | 8 MB           | 16           | 64 B             | 8192     |

Characteristics of the Intel Core i7 cache hierarchy.

### Performance Impact of Cache Parameters

- Associativity:
  - higher associativity → more complex hardware
  - Higher Associativity → Lower miss rate
  - Higher Associativity → reduces average memory access time (AMAT)
- Cache Size
  - Larger the cache size → Lower miss rate
  - Larger the cache size → reduces average memory access time (AMAT)
- Block Size:
  - Smaller blocks do not take maximum advantage of spatial locality.



## Revisiting Locality of reference

```
int sumvec(int v[N])

int i, sum = 0;

for (i = 0; i < N; i++)

sum += v[i];

return sum;

}</pre>
```

Does this function have good locality?

| N=8          |    |    |     |    |    |    |    |    |  |
|--------------|----|----|-----|----|----|----|----|----|--|
| Address      | 0  | 1  | 2   | 3  | 4  | 5  | 6  | 7  |  |
| Contents     | v0 | v1 | v2  | v3 | v4 | v5 | v6 | V7 |  |
| Access Order | 1  | 2  | 3 / | 4  | 5  | 6  | 7  | 8  |  |

# Stride k - reference pattern

Byte Addressable memory and word length is 1 byte



Stride 2

| i  | Address |   |
|----|---------|---|
| 0  | 0000    | _ |
| 1  | 0001    | _ |
| 2  | 0002    |   |
| 3  | 0003    |   |
| 4  | 0004    |   |
| 5  | 0005    |   |
| 6  | 0006    |   |
| 7  | 0007    |   |
| 8  | 0008    |   |
| 9  | 0009    |   |
| 10 | 000A    |   |
| 11 | 000B    |   |
| 12 | 000C    |   |

Stride 1



|    | <del></del> |   |
|----|-------------|---|
| i  | Address     |   |
| 0  | 0000        |   |
| 1  | 0001        |   |
| 2  | 0002        | _ |
| 3  | 0003        |   |
| 4  | 0004        |   |
| 5  | 0005        |   |
| 6  | 0006        |   |
| 7  | 0007        |   |
| 8  | 0008        |   |
| 9  | 0009        |   |
| 10 | 000A        |   |
| 11 | 000B        |   |
| 12 | 000C        |   |

#### Stride k - reference pattern

innovate achieve lead

Byte Addressable memory and word length is 2 bytes

| i | Address |
|---|---------|
| 0 | 0000    |
| 1 | 0002    |

2 0004

3 0006

4 0008

5 000A

6 000C

7 000E

8 0010

9 0012

10 0014

11 0016

12 0018

#### Stride 1



| i  | Address |   |
|----|---------|---|
| 0  | 0000    | _ |
| 1  | 0002    |   |
| 2  | 0004    | _ |
| 3  | 0006    |   |
| 4  | 0008    | _ |
| 5  | 000A    |   |
| 6  | 000C    |   |
| 7  | 000E    |   |
| 8  | 0010    |   |
| 9  | 0012    |   |
| 10 | 0014    |   |
|    |         |   |

0016

0018

11

12

BITS Pilani, Pilani Campus

Stride 2

#### Revisiting Locality of reference

```
int sumarrayrows(int a[M][N])

int i, j, sum = 0;

for (i = 0; i < M; i++)

for (j = 0; j < N; j++)

sum += a[i][j];

return sum;

}
</pre>
```

Does this function have good locality?

| M = 2, N=3          |     |     |     |     |     |     |  |
|---------------------|-----|-----|-----|-----|-----|-----|--|
| Address 0 1 2 3 4 5 |     |     |     |     |     |     |  |
| Contents            | a00 | a01 | a02 | a10 | a11 | a12 |  |
| Access Order        | 1   | 2   | 3   | 4   | 5   | 6   |  |

#### Revisiting Locality of reference

Does this function have good locality?

| M = 2, N=3   |     |     |     |     |     |     |  |
|--------------|-----|-----|-----|-----|-----|-----|--|
| Address      | 0   | 1   | 2   | 3   | 4   | 5   |  |
| Contents     | a00 | a01 | a02 | a10 | a11 | a12 |  |
| Access Order | 1   | 3   | 5   | 2   | 4   | 6   |  |



#### Writing Cache Friendly Code

- Make the common case go fast
  - Focus on the inner loops of the core functions
- Minimize the misses in the inner loops
  - Repeated references to variables are good (temporal locality)
  - Stride-1 reference patterns are good (spatial locality)

#### Example 1

```
int sumarrayrows(int a[4][4])
int i, j, sum = 0;
for (i = 0; i < 4; i++)
  for (j = 0; j < 4; j++)
   sum += a[i][j];
return sum:
```

#### Assumption:

- The cache has a block size of 4 words each, 2 cache lines
- Word size 4 bytes.
- C stores arrays in row-major order

# Example 1(Contd..)

```
int sumarrayrows(int a[4][4])
{
  int i, j, sum = 0;
  for (i = 0; i < 4; i++)
     for (j = 0; j < 4; j++)
     sum += a[i][j];
  return sum;
}</pre>
```

| A[i][j] | J = 0 | J = 1 | J = 2 | J = 3 |
|---------|-------|-------|-------|-------|
| i = 0   |       |       |       |       |
| i = 1   |       |       |       |       |
| i = 2   |       |       |       |       |
| I = 3   |       |       |       |       |

#### Example 2:

```
int sum_array(int a[4][4])
{
  int i, j, sum = 0;
  for (j = 0; j < 4; j++)
    for (i = 0; i < 4; i++)
    sum += a[i][j];
  return sum;
}</pre>
```

#### Assumption:

- The cache has a block size of 4 words each, 2 cache lines
- Word size 4 bytes.
- C stores arrays in row-major order

# Example 1(Contd..)

```
int sum_array(int a[4][4])
int i, j, sum = 0;
for (j = 0; j < 4; j++)
 for (i = 0; i < 4; i++)
  sum += a[i][j];
return sum;
```

| A[i][j] | J = 0 | J = 1 | J = 2 | J = 3 |
|---------|-------|-------|-------|-------|
| i = 0   |       |       |       |       |
| i = 1   |       |       |       |       |
| i = 2   |       |       |       |       |
| I = 3   |       |       |       |       |

| a [6]      | ಬ₀               |  |
|------------|------------------|--|
| داو]ل      | ا <sub>ل</sub> ا |  |
| 1 CO 3 157 |                  |  |

| المِلاك الله | M 2 |
|--------------|-----|
| a [🗓 [3]     | ผฐ  |

| (ما [الم | Խ <sub>և</sub>    |
|----------|-------------------|
| a [i] [j | الما <del>د</del> |

# Home Work - Which one is better?



```
Program 1:

for (int i = 0; i < n; i++) {

   z[i] = x[i] - y[i];

   z[i] = z[i] * z[i];

}
```

```
Program 2:
for (int i = 0; i < n; i++) {
    z[i] = x[i] - y[i];
}
for (int i = 0; i < n; i++) {
    z[i] = z[i] * z[i];
}</pre>
```

# That's all For Now Folks

### Problem 2 - LRU



BIT



# Problem 2 - LFU

|       |   |              |     |                                      | /   |     | // | •   |     | /   |
|-------|---|--------------|-----|--------------------------------------|-----|-----|----|-----|-----|-----|
| Ref   | 0 | 4            | 0   | 2                                    | 1.  | 8   | 0  | 1   | 2.  | 3   |
| LO    |   | $\bigcirc$   | 0/2 | $\mathcal{L} \bigcirc_{\mathcal{V}}$ | On  | 02  | 03 | 031 | D3  | 03  |
| L1    |   | 4            | 4   | 4,                                   | (4) | \$1 | 81 | 81  | Sy  | 3,  |
| L2/   |   | -"           |     | 2                                    | 21  | 21  | 21 | 21  | 2,2 | 21  |
| L3 // |   |              | _   |                                      | 1,  | \ \ |    | 12  | 12  | \ L |
| H/M   | M | $\mathbb{N}$ | H   | M                                    | M   | M   | 1  | 1   | 11  | M   |
|       |   | -            | ()  |                                      | \   | 1 ( | 12 | -13 | 17  | (   |
|       | H | \ -          | 5   |                                      |     |     | _  | 0   | 4   |     |
|       | ١ | `            |     | $\setminus V$                        |     |     |    |     |     |     |

BIT





| Ref  | 0        | 4                   | Ó                 | 2              | 1   | 8   | 0    | 1   | 2           | 3          |
|------|----------|---------------------|-------------------|----------------|-----|-----|------|-----|-------------|------------|
| time | 0        | 1                   | 2                 | 3              | 4   | 5   | 6    | 7   | . 8         | 9          |
| LO   | 00       | $\bigcirc_{\delta}$ | $O_{\mathcal{O}}$ | $O_{\upsilon}$ | 500 | 85  | \$ S | 45  | 85          | 45         |
| L1   |          | 4.                  | 41                | 41             | 49  | 41  | 06   | 06  | 06          | 0          |
| L2   |          | )                   | 2                 | 23             | 23  | 23  | 23   | 23. | 22          | 3          |
| L3   |          | )                   | _                 |                | 14  | 14  | 14   | (h  | 14          | 1          |
| H/M  | M        | M                   | 14                | M              | M   | M   | VV   | 1 \ | \<br>\<br>\ | $\nabla V$ |
|      | , (<br>ノ |                     | 4                 |                | V   | V ( |      | 4   | 11          |            |
|      | 12R      |                     |                   | <b>*</b>       |     |     | b \- | D   | 3           | R          |
|      |          |                     |                   |                |     |     |      |     |             |            |

BIT

**Back** 



#### Associative Cache Organization

**Back** 

